Goto

Collaborating Authors

 Maricopa




From Stoplights to On-Ramps: A Comprehensive Set of Crash Rate Benchmarks for Freeway and Surface Street ADS Evaluation

Scanlon, John M., McMurry, Timothy L, Chen, Yin-Hsiu, Kusano, Kristofer D., Victor, Trent

arXiv.org Artificial Intelligence

This paper presents crash rate benchmarks for evaluating US-based Automated Driving Systems (ADS) for multiple urban areas. The purpose of this study was to extend prior benchmarks focused only on surface streets to additionally capture freeway crash risk for future ADS safety performance assessments. Using publicly available police-reported crash and vehicle miles traveled (VMT) data, the methodology details the isolation of in-transport passenger vehicles, road type classification, and crash typology. Key findings revealed that freeway crash rates exhibit large geographic dependence variations with any-injury-reported crash rates being nearly 3.5 times higher in Atlanta (2.4 IPMM; the highest) when compared to Phoenix (0.7 IPMM; the lowest). The results show the critical need for location-specific benchmarks to avoid biased safety evaluations and provide insights into the vehicle miles traveled (VMT) required to achieve statistical significance for various safety impact levels. The distribution of crash types depended on the outcome severity level. Higher severity outcomes (e.g., fatal crashes) had a larger proportion of single-vehicle, vulnerable road users (VRU), and opposite-direction collisions compared to lower severity (police-reported) crashes. Given heterogeneity in crash types by severity, performance in low-severity scenarios may not be predictive of high-severity outcomes. These benchmarks are additionally used to quantify at the required mileage to show statistically significant deviations from human performance. This is the first paper to generate freeway-specific benchmarks for ADS evaluation and provides a foundational framework for future ADS benchmarking by evaluators and developers.


Forecasting Coccidioidomycosis (Valley Fever) in Arizona: A Graph Neural Network Approach

Sarabi, Ali, Sarabi, Arash, Yan, Hao, Sterner, Beckett, Jevtić, Petar

arXiv.org Artificial Intelligence

Coccidioidomycosis, commonly known as Valley Fever, remains a significant public health concern in endemic regions of the southwestern United States. This study develops the first graph neural network (GNN) model for forecasting Valley Fever incidence in Arizona. The model integrates surveillance case data with environmental predictors using graph structures, including soil conditions, atmospheric variables, agricultural indicators, and air quality metrics. Our approach explores correlation-based relationships among variables influencing disease transmission. The model captures critical delays in disease progression through lagged effects, enhancing its capacity to reflect complex temporal dependencies in disease ecology. Results demonstrate that the GNN architecture effectively models Valley Fever trends and provides insights into key environmental drivers of disease incidence. These findings can inform early warning systems and guide resource allocation for disease prevention efforts in high-risk areas.


RelationalFactQA: A Benchmark for Evaluating Tabular Fact Retrieval from Large Language Models

Satriani, Dario, Veltri, Enzo, Santoro, Donatello, Papotti, Paolo

arXiv.org Artificial Intelligence

Factuality in Large Language Models (LLMs) is a persistent challenge. Current benchmarks often assess short factual answers, overlooking the critical ability to generate structured, multi-record tabular outputs from parametric knowledge. We demonstrate that this relational fact retrieval is substantially more difficult than isolated point-wise queries, even when individual facts are known to the model, exposing distinct failure modes sensitive to output dimensionality (e.g., number of attributes or records). To systematically evaluate this under-explored capability, we introduce RelationalFactQA, a new benchmark featuring diverse natural language questions (paired with SQL) and gold-standard tabular answers, specifically designed to assess knowledge retrieval in a structured format. RelationalFactQA enables analysis across varying query complexities, output sizes, and data characteristics. Our experiments reveal that even state-of-the-art LLMs struggle significantly, not exceeding 25% factual accuracy in generating relational outputs, with performance notably degrading as output dimensionality increases. These findings underscore critical limitations in current LLMs' ability to synthesize structured factual knowledge and establish RelationalFactQA as a crucial resource for measuring future progress in LLM factuality.


SatFlow: Generative model based framework for producing High Resolution Gap Free Remote Sensing Imagery

Irigireddy, Bharath, Bandaru, Varaprasad

arXiv.org Artificial Intelligence

Frequent, high-resolution remote sensing imagery is crucial for agricultural and environmental monitoring. Satellites from the Landsat collection offer detailed imagery at 30m resolution but with lower temporal frequency, whereas missions like MODIS and VIIRS provide daily coverage at coarser resolutions. Clouds and cloud shadows contaminate about 55\% of the optical remote sensing observations, posing additional challenges. To address these challenges, we present SatFlow, a generative model-based framework that fuses low-resolution MODIS imagery and Landsat observations to produce frequent, high-resolution, gap-free surface reflectance imagery. Our model, trained via Conditional Flow Matching, demonstrates better performance in generating imagery with preserved structural and spectral integrity. Cloud imputation is treated as an image inpainting task, where the model reconstructs cloud-contaminated pixels and fills gaps caused by scan lines during inference by leveraging the learned generative processes. Experimental results demonstrate the capability of our approach in reliably imputing cloud-covered regions. This capability is crucial for downstream applications such as crop phenology tracking, environmental change detection etc.,


Dynamic Benchmarks: Spatial and Temporal Alignment for ADS Performance Evaluation

Chen, Yin-Hsiu, Scanlon, John M., Kusano, Kristofer D., McMurry, Timothy L., Victor, Trent

arXiv.org Artificial Intelligence

Deployed SAE level 4+ Automated Driving Systems (ADS) without a human driver are currently operational ride-hailing fleets on surface streets in the United States. This current use case and future applications of this technology will determine where and when the fleets operate, potentially resulting in a divergence from the distribution of driving of some human benchmark population within a given locality. Existing benchmarks for evaluating ADS performance have only done county-level geographical matching of the ADS and benchmark driving exposure in crash rates. This study presents a novel methodology for constructing dynamic human benchmarks that adjust for spatial and temporal variations in driving distribution between an ADS and the overall human driven fleet. Dynamic benchmarks were generated using human police-reported crash data, human vehicle miles traveled (VMT) data, and over 20 million miles of Waymo's rider-only (RO) operational data accumulated across three US counties. The spatial adjustment revealed significant differences across various severity levels in adjusted crash rates compared to unadjusted benchmarks with these differences ranging from 10% to 47% higher in San Francisco, 12% to 20% higher in Maricopa, and 7% lower to 34% higher in Los Angeles counties. The time-of-day adjustment in San Francisco, limited to this region due to data availability, resulted in adjusted crash rates 2% lower to 16% higher than unadjusted rates, depending on severity level. The findings underscore the importance of adjusting for spatial and temporal confounders in benchmarking analysis, which ultimately contributes to a more equitable benchmark for ADS performance evaluations.


RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Ge, Junyao, Zheng, Yang, Guo, Kaitai, Liang, Jimin

arXiv.org Artificial Intelligence

Abundant, well-annotated multimodal data in remote sensing are pivotal for aligning complex visual remote sensing (RS) scenes with human language, enabling the development of specialized vision language models across diverse RS interpretation tasks. However, annotating RS images with rich linguistic semantics at scale demands expertise in RS and substantial human labor, making it costly and often impractical. In this study, we propose a workflow that leverages large language models (LLMs) to generate multimodal datasets with semantically rich captions at scale from plain OpenStreetMap (OSM) data for images sourced from the Google Earth Engine (GEE) platform. This approach facilitates the generation of paired remote sensing data and can be readily scaled up using openly available data. Within this framework, we present RSTeller, a multimodal dataset comprising over 1 million RS images, each accompanied by multiple descriptive captions. Extensive experiments demonstrate that RSTeller enhances the performance of multiple existing vision language models for RS scene understanding through continual pre-training. Our methodology significantly reduces the manual effort and expertise needed for annotating remote sensing imagery while democratizing access to high-quality annotated data. This advancement fosters progress in visual language modeling and encourages broader participation in remote sensing research and applications. The RSTeller dataset is available at https://github.com/SlytherinGe/RSTeller.


Benchmarks for Retrospective Automated Driving System Crash Rate Analysis Using Police-Reported Crash Data

Scanlon, John M., Kusano, Kristofer D., Fraade-Blanar, Laura A., McMurry, Timothy L., Chen, Yin-Hsiu, Victor, Trent

arXiv.org Artificial Intelligence

With fully automated driving systems (ADS; SAE level 4) ride-hailing services expanding in the US, we are now approaching an inflection point, where the process of retrospectively evaluating ADS safety impact can start to yield statistically credible conclusions. An ADS safety impact measurement requires a comparison to a "benchmark" crash rate. This study aims to address, update, and extend the existing literature by leveraging police-reported crashes to generate human crash rates for multiple geographic areas with current ADS deployments. All of the data leveraged is publicly accessible, and the benchmark determination methodology is intended to be repeatable and transparent. Generating a benchmark that is comparable to ADS crash data is associated with certain challenges, including data selection, handling underreporting and reporting thresholds, identifying the population of drivers and vehicles to compare against, choosing an appropriate severity level to assess, and matching crash and mileage exposure data. Consequently, we identify essential steps when generating benchmarks, and present our analyses amongst a backdrop of existing ADS benchmark literature. One analysis presented is the usage of established underreporting correction methodology to publicly available human driver police-reported data to improve comparability to publicly available ADS crash data. We also identify important dependencies in controlling for geographic region, road type, and vehicle type, and show how failing to control for these features can bias results. This body of work aims to contribute to the ability of the community - researchers, regulators, industry, and experts - to reach consensus on how to estimate accurate benchmarks.


Clustering US Counties to Find Patterns Related to the COVID-19 Pandemic

Brown, Cora, Milstein, Sarah, Sun, Tianyi, Zhao, Cooper

arXiv.org Artificial Intelligence

When COVID-19 first started spreading and quarantine was implemented, the Society for Industrial and Applied Mathematics (SIAM) Student Chapter at the University of Minnesota-Twin Cities began a collaboration with Ecolab to use our skills as data scientists and mathematicians to extract useful insights from relevant data relating to the pandemic. This collaboration consisted of multiple groups working on different projects. In this write-up we focus on using clustering techniques to help us find groups of similar counties in the US and use that to help us understand the pandemic. Our team for this project consisted of University of Minnesota students Cora Brown, Sarah Milstein, Tianyi Sun, and Cooper Zhao, with help from Ecolab Data Scientist Jimmy Broomfield and University of Minnesota student Skye Ke. In the sections below we describe all of the work done for this project. In Section 2, we list the data we gathered, as well as the feature engineering we performed. In Section 3, we describe the metrics we used for evaluating our models. In Section 4, we explain the methods we used for interpreting the results of our various clustering approaches. In Section 5, we describe the different clustering methods we implemented. In Section 6, we present the results of our clustering techniques and provide relevant interpretation. Finally, in Section 7, we provide some concluding remarks comparing the different clustering methods.